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Abstract. When humans describe images they tend to use combina¬ 
tions of nouns and adjectives, corresponding to objects and their as¬ 
sociated attributes respectively. To generate such a description auto¬ 
matically, one needs to model objects, attributes and their associations. 
Conventional methods require strong annotation of object and attribute 
locations, making them less scalable. In this paper, we model object- 
attribute associations from weakly labelled images, such as those widely 
available on media sharing sites (e.g. Flickr), where only image-level la¬ 
bels (either object or attributes) are given, without their locations and 
associations. This is achieved by introducing a novel weakly supervised 
non-parametric Bayesian model. Once learned, given a new image, our 
model can describe the image, including objects, attributes and their as¬ 
sociations, as well as their locations and segmentation. Extensive exper¬ 
iments on benchmark datasets demonstrate that our weakly supervised 
model performs at par with strongly supervised models on tasks such as 
image description and retrieval based on object-attribute associations. 

Keywords: Weakly supervised learning, object attribute associations 


1 Introduction 


Vision research is moving beyond simple classification, annotation and detection 
to encompass generating more structured and semantic descriptions of images. 
When humans describe images they use combinations of nouns and adjectives, 
corresponding to objects and their associated attributes respectively. For exam¬ 
ple, an image can be described as containing “a person in red clothes and a shiny 
car”. In order to imitate this ability, a computer vision system needs to learn 
models about objects, attributes, and their associations. Object-attribute asso¬ 
ciations is important for avoiding the mistakes such as “a shiny person and a red 
car”. Learning object-attribute association also provides new query capabilities, 
e.g., “find images with a furry brown horse and a red shiny car”. 


There has been extensive work on detecting and segmenting objects 33 6 31 
and describing specified objects and images in terms of semantic attributes 
[^. However, these tasks have previously been treated separately; 
jointly learning about and inferring object-attribute association in images with 
potentially multiple objects is much less studied. The few existing studies on 
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Fig. 1: Comparing our weakly supervised approach to object-attribute associa¬ 
tion learning to the conventional strongly supervised approach. 
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In the conventional 

pipeline (Fig. ^ images are strongly labelled with object bounding boxes and 
associated attributes, from which object detectors and attribute classifiers are 
trained.Given a new image, the learned object detectors are first applied to 
find object locations, where the attribute classifiers are then applied to produce 
the object descriptions. Ffowever, there is a critical limitation of the existing ap¬ 
proach: it requires strongly labelled objects and attributes. Considering there are 
over 30,000 object classes distinguishable to humans 18 , even more attributes 


to describe them, and an infinite number of combinations, it is not scalable. 


In this paper we propose to learn objects, attributes, and their associations 
from weakly labelled data. That is, images with object and attribute labels but 
not their associations nor their locations (see Fig.[^. Such weakly labelled images 
are abundant on media sharing websites such as Flickr. Therefore lack of training 
data would never be a problem. However, learning strong semantics, i.e. explicit 
object-attribute association from weakly labelled images is extremely challenging 
due to the label ambiguity: a real-world image with the tags “dog, white, coat, 
furry” could contain a furry dog and a white coat or a furry coat and a white 
dog. Furthermore, the tags/labels typically only describe the foreground/objects. 
There could be a white building in the background which is ignored by the 
annotator, and a computer vision model must infer that this is not what the 
tag ‘white’ refers to. Conventional methods cannot be applied without object 
locations and explicit object-attribute association being labelled. 


To address the challenges of learning strong semantics from weak annotation, 
we develop a unified probabilistic generative model capable of jointly learning 
objects, attributes and their associations, as well as their location and segmen¬ 
tation. Our model is also able to learn from realistic images where there are 
multiple objects of variable sizes per image such as PASCAL VOC. More specif- 
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ically, our model generalises the non-parametric Indian Buffet Process (IBP) 


13 


The IBP is chosen because it is designed for explaining multiple factors 
that simultaneously co-exist to account for the appearance of a particular image 
or patch, e.g., such factors can be an object and its particular texture and colour 
attributes. However, the conventional IBP is limited in that it is unsupervised 
and, as a flat model, applies to either patches or images, not both; it thus can¬ 
not be directly applied to our problem. To overcome these limitations, a novel 
model termed Weakly Supervised Stacked Indian Buffet Process (WS-SIBP) is 
formulated in this work. By introducing hierarchy into IBP, WS-SIBP is able 
to group data, thus allowing it to explain images as groups of patches, each of 
which has an inferred multi-label description vector corresponding to an object 
and its associated attributes. We also introduce weak image-level supervision, 
which is disambiguated into multi-label patch explanations by our WS-SIBP. 

Modelling weakly labelled images using our framework provides a number 
of benefits: (i) By jointly learning multiple objects, attributes and background 
clutter in a single framework, ambiguity in each is explained away by knowledge 
of the other, (ii) The infinite number of factors provided by the non-parametric 
Bayesian framework allows structured background clutter of unbounded com¬ 
plexity to be explained away, (iii) A sparse binary latent representation of each 
patch allows an unlimited number of attributes to co-exist on one object. The 
aims and capabilities of our approach are illustrated schematically in Fig. 
where weak annotation in the form of a mixture of objects and attributes is 
transformed into object and attribute associations with locations. 


2 Related work 


Learning objects and attributes A central task in computer vision is un¬ 
derstanding image content. Such an understanding has been shown in the form 
of an image description in terms of nouns (object detection or region segmen¬ 
tation), and more recently adjectives (visual attributes ) ^ 7 \ . Attributes have 
been used to describe objects [9 34 , people [^, clothing [4j , scenes 36 , faces 32 , 
and video events 12 . However, most previous studies have learned and inferred 
object and attribute models separately, e.g., by independently training binary 
classifiers, and require strong annotations/labels indicating object/attribute lo¬ 
cations and/or associations if the image is not dominated by a single object. 
Learning object-attribute associations A few recent studies have learned 
object-attribute association explicitly [^ [^ [^ [^ [^ |^ |^ . Different from 
our approach, 35 37 38 20 only trains and tests on unambiguous data, i.e. im¬ 


ages containing a single dominant object, assumes object-attribute association 
is known at training; and moreover allocates exactly one attribute per object. 

17 tests on more challenging PASCAL VOC data with multiple objects and at¬ 
tributes coexisting. However, their model is pre-trained on object and attribute 
detectors learned on strongly annotated images with object bounding boxes pro¬ 


vided. 36 also does object segmentation and object-attribute prediction. But 


their model is learned from strongly labelled images in that object-attribute 
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association are given during training; and importantly prediction is restricted 
to object-attribute pairs seen during training. In summary none of the existing 
work learns object-attribute association from weakly labelled data as we do here. 


Multi-attribute query Some existing work aims to perform attribute-based 
3^. In particular, Recent studies have considered how to 
multiple attribute scores in a single query. We go 


query 26 

15 

calibrate 

30 ; 


30 


and fuse 


26 


beyond these studies in supporting conjunction of object-hmulti-attribute query. 
Moreover, existing methods either require bounding boxes or assume simple 
data with single dominant objects, and do not reason jointly about multiple 
attribute-object association. This means they would be intrinsically challenged 
in reasoning about (multi)-attribute-object queries on challenging data with mul¬ 
tiple objects and multiple attributes in each image (e.g., querying furry brown 
horse, in a dataset with black horses and furry dogs in the same image). In 
other words, they cannot be directly extended to solve query by object-attribute 
association. 


Probabilistic models for image understanding Discriminative kernel meth 
ods underpin many high performance recognition and annotation studies [9 27 


. However the flexibility of generative probabilistic mod¬ 
els has seen them successfully applied to a variety of tasks, especially learning 
structured scene representations, and weakly-supervised learning 31 19 


These studies often generalise probabilistic topic models (PTM) [^. However 
PTMs are limited for explaining objects and attributes in that latent topics are 
competitive - the fundamental assumption is that an object is a horse or brown 
or furry. They intrinsically do not account for the reality that it is all at once. 


We therefore generalise instead the Indian Buffet Process (IBP) 0[T3). The 
IBP is a latent feature model that can independently activate each latent factor, 
explaining imagery as a weighted sum of active factor appearances. However, 
conventional IBP is (i) fully unsupervised, and (ii) only handles fiat data. Thus, 
it could explain patches or images, but not images composed of patches, thereby 
limiting usefulness for multiple object-attribute association within images. We 
therefore formulate a novel Weakly Supervised Stacked Indian Buffet Process 
(WS-SIBP) to model grouped data (images composed of patches), such that 
each patch has an infinite latent feature vector. This allows us to exploit image- 
level weak supervision, but disambiguate it to determine the best explanation 
in terms of which patches correspond to un-annotated background; which patch 
corresponds to which annotated object; and which objects have which attributes. 


Weakly supervised learning Weakly supervised learning (WSL) has at¬ 
tracted increasing attention as the volume of data which we are interested in 
learning from grows much faster than available annotation. Existing studies have 
generally focused on WSL of objects alone |^[^[^, with limited work on WSL 
of attributes . Some studies have treated this as a discriminative multi¬ 

instance learning (MIL) problem 22 while others leveraged PTMs 31 T^. 


Weakly supervised localisation is a particularly challenging variant where images 
are annotated with objects, but absent bounding boxes means their location is 
unknown. This has been solved by sampling bounding boxes for MIL treatment 
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[^, or more ‘softly’ by PTMs 31 . In this paper we uniquely consider WSL of 
both objects, attributes, their associations and their locations simultaneously. 
Our contributions In this paper we make three key contributions: (i) We 
for the first time jointly learn all object, attribute and background appearances, 
object-attribute association, and their locations from realistic weakly labelled 
images; (ii) We formulate a novel weakly supervised non-parametric Bayesian 
model by generalising the Indian Buffet Process; (iii) From this weakly labelled 
data, we demonstrate various image description and query tasks, including chal¬ 
lenging tasks relying on predicting strong object-attribute association. Extensive 
experiments on benchmark datasets demonstrate that in each case our model is 
comparable to the strongly supervised alternatives and significantly outperforms 
a number of weakly supervised baselines. 


3 Weakly Supervised Stacked Indian Buffet Process 

We propose a non-parametric Bayesian model that learns to describe images 
composed of super-pixels/patches from weak object and attribute annotation. 
Each patch is associated with an infinite latent factor vector indicating if it 
corresponds to (an unlimited variety of) unannotated background clutter, or an 
object of interest, and what set of attributes are possessed by the object. Given a 
set of images with weak labels and segmented into super-pixels/patches, we need 
to learn: (i) which are the unique patches shared by all images with a particular 
label, (ii) which patches correspond to unannotated background, and (iii) what 
is the appearance of each object, attribute and background type. Moreover, since 
multiple labels (attribute and object) can apply to a single patch, we need to 
disambiguate which aspects of the appearance of the patch are due to each 
of the (unknown) associated object and attribute labels. To address all these 
learning tasks we build on the IBP and introduce a weakly-supervised stacked 
Indian Buffet process (WS-SIBP) to model data represented as bags (images) of 
instances (patches) with bag-level labels (image annotations). This is analogous 
to the notion of documents in topic models . 

3.1 Model formulation 

Eirst, we associate each object category and each attribute to a latent factor. If 
there are Kq object categories and Ka attributes, then the first Koa = Kq + Ka 
latent factors correspond to these. An unbounded number of further factors are 
available to explain away background clutter in the data. At training time, we 
assume a binary label vector for objects and attributes is provided for each 
image i. So = I if attribute/object k is present, and zero otherwise. Also 
= I for all k ^ Kq(i. That is, without any labels, we assume all background 
types can be present. With these assumptions, the generative process (illustrated 
in Eig. for image i represented as bags of patches is as follows: 

Eor each latent factor k e 1... oc: 
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Fig. 2: The graphical model for WS-SIBP. Shaded nodes are observed. 


1. Draw an appearance distribution mean A/.. ^ A/’(0,cr^/). 

For each image i G 1... M: 

1. Draw a sequence of i.i.d. random variables v^\v 2 ^ ... ^ Beta((a, 1), 

2. Construct an image prior H ^ 

t=i 

3. Input weak annotation G {0,1}, 

4. For each super-pixel patch j G 1... Nf. 

(a) Sample state of each latent factor k: ^ Bern(7r^*^I/[.*^), 

(b) Sample patch appearance: 

where A/", Bern and Beta respectively correspond to Normal, Bernoulli and Beta 
distributions with the specified parameters; and the notation Xj. means the 
vector of row j in matrix X. The Beta-Bernoulli and Normal-Normal conjugacy 
are chosen because they allow more efficient inference, a is the prior expected 
sparsity of annotations and is the prior variance in appearance for each factor. 

Denote hidden variables by iT = , • • •, , • • •, , A}, images 

by X = {X^^\ ..., and parameters by 0 = {a, cr^, cr, L). Then the joint 

probability of the variables and data given the parameters is: 

M X oo Ni 

p{H,x\0 )=n (n (rfi«) 

i=l ^ /c=l 3 = 1 

Ni X oo 

■\{p{xf}\zf ,A,a)\Y[p{AM)- ( 1 ) 

3=1 ^ k=l 

Learning in our model aims to compute the posterior p(iT|X, 0) for: disam¬ 
biguating and localising all the annotated objects and attributes among 
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the patches (inferring inferring the attribute and background prior for each 

image (inferring and learning the appearance of each factor (inferring A/..). 


3.2 Model learning 

Exact inference for p{H\X^0) in our stacked IBP is intractable, so an ap¬ 
proximate inference algorithm in the spirit of is developed. The mean field 
variational approximation to the desired posterior p{H\X, 0) is: 

M 

q{H) = l[{qM''’)<lAZ^^'>))Q<piA) ( 2 ) 

i=l 


where qA^k^) = Beta(4*^ qAzfA = Bernoulli{ 2 :yj;'; qA^k-) 

and the infinite stick-breaking process for latent factors is trun¬ 
cated at Kmax^ SO TT/c = 0 for k > Kmax' A variatioual message passing (VMP) 
strategy can be used to minimise the KL divergence of Eq. (§ to the true 
posterior. Updates are obtained by deriving integrals of the form Ing(h) = 
A^H\h X)] + C for each group of hidden variables h. These result in 

the series of iterative updates given in Algorithm where (p{-) is the digamma 






function; and and E-y[fog(l — Yl '^1*0] given in j^. In practice, the trun¬ 
cation approximation means that our WS-SIBP runs with a finite number of 
factors Kmax where truncation factor Kmax can be freely set so long as it is 
bigger than the number of factors needed by both annotations and background 
clutter {K}jg), i.e., K^ax ^ Kq Ka + Despite the combinatorial nature of 

the object-attribute association and localisation problem, our model is of com¬ 
plexity 0{MNDKmax) for M images with N patches, D feature dimension and 




‘” 0 : 


A. 


truncation factor. 


3.3 Inference for test data 

At testing time, the appearance of each factor k, now modelled by sufficient 
statistics is assumed to be known (learned from the training 

data), while annotations for each test image will need to be inferred. Thus 
Algorithm still applies, but without the appearance update terms and with 
= 1 V/c, to reflect the fact that all the learned object, attribute, and back¬ 
ground types could be present without any prior knowledge. 


3.4 Applications of the model 

Given the learned model applied to testing data, we can perform the following 
tasks: Free Annotation: This is to describe an image using a list of nouns and 
adjectives corresponding to objects and their associated attributes, as well as 
locating them. To infer what objects are present in image i, the first Kq latent 
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Algorithm 1: Variational Inference for WS-SIBP 


while not converge do 
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end 


end 


end 


factors of the inferred are thresholded or ranked to obtain a list of objects. 
This is followed by locating them via searching for the patches j* maximising 
then thresholding or ranking the Ka attribute latent factors in to 

describe them. 

Annotation given object names: This is a more constrained variant of 
the free annotation task above. Given a named (but not located) object k, its 
associated attributes can be estimated by first finding the location as = 
arg max , then the associated attributes by for Ko < k < Ko N Ka . 

3 

Object+Attribute Query: Images can be queried for a specified object- 
attribute conjunction < ko^ka > hy searching for i"" = arg max zj^^^ • zj^^ . 


4 Experiments 


Datasets: 

Various object and attribute datasets are available such as aPascal, ImageNet, 
SUN 24 and AwA 18 . We use aPascal because it has multiple objects per 


image; and ImageNet due to sharing attributes widely across categories. 
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Bound ing-Box-level: 

Person 1: head, cloth, arm 


Person 2 : head, cloth 


Aeroplane : metal, wing 


Image-level: 

person, head, cloth, arm, 
aeroplane, metal, wing 



Fig. 3: Strong bounding-box-level annotation 
and weak image-level annotations for aPascal 
are used for learning strongly supervised models 
and weakly supervised models respectively. 


Fig. 4: 43 subordinate 

classes of dog are converted 
into a single entry-level 
class ‘dog’. 


aPascal: This dataset is an attribute labelled version of PASCAL VOC 
2008. There are 4340 images of 20 object categories. Each object is annotated 
with a list of 64 attributes that describe them by shape (e.g., isBoxy), parts 
(e.g., hasHead) and material (e.g., isFurry). In the original aPascal, attributes 
are strongly labelled for 12695 object bounding boxes, i.e. the object-attribute 
association are given. To test our weakly supervised approach, we merge the 
object-level category annotations and attribute annotations into a single an¬ 
notation vector of length 84 for the entire image. This image-level annotation 
is much weaker than the original bounding-box-level annotation, as shown in 
Fig. In all experiments, we use the same train/test splits provided by |^. 

ImageNet Attribute: This dataset contains 9600 images from 384 Im- 
ageNet synsets/categories. We ignore the provided bounding box annotation. 
Attributes for each bounding box are labelled as 1 (presence), -1 (absence) or 
0 (ambiguous). We use the same 20 of 25 attributes as and consider 1 and 
0 as positive examples. Many of the 384 categories are subordinate categories, 
e.g. dog breeds. However, distinguishing fine-grained subordinate categories is 
beyond the scope of this study. We are interested in finding a ‘black-dog’ or 
‘white-car’, rather than ‘black-mutt’ or ‘white-ford-focus’. We thus convert the 
384 ImageNet categories to 172 entry-level categories using (see Fig.[^. We 
evenly split each class to create the training and testing sets. 


Features: 

We first convert each image i to Ni super-pixels/patches by a recent segmenta¬ 
tion algorithm [^. We set the segmentation threshold to 0.1 to obtain a single 
over-segmentation from the hierarchical segmentation for each image. Each seg¬ 
mented patch is represented using two types of normalised histogram features: 
SIET and Color. (1) SIET: we extract regular grid (every 5 pixels) colorSIET 29 


at four scales. A 256 component GMM model is constructed on the collection of 
Colour SIETs from all images. We compute Eisher Vector + PC A for all regular 
points in each patch following [^. The resulting reduced descriptor is 512-D 
for every segmented region. (2) Colour: We convert the image to quantised LAB 
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space 8x8x8. A 512-D color histogram is then computed for each patch. The 
final 1024-D feature vector concatenates SIFT and Colour features together. 

Compared Methods: 

We compare our WS-IBP to one strongly supervised model and three weakly 
supervised alternatives: 

Strongly supervised model: A strongly supervised model uses bounding-box- 
level annotation. Two variants are considered for the two datasets respectively. 
DPM+s-SVM: for aPascal, both object detector and attribute classifier are 
trained from fully supervised data (i.e. Bounding-Box-level annotation in Fig.[^. 


Specifically, we use the 20 pre-trained DPM detectors from 10 and 64 attribute 


classifiers from [|. GT+s-SVM; for ImageNet attributes, there is not enough 
data to learn 172 strong DPM detectors as in aPascal. So we use the ground 
truth bounding box instead assuming we have perfect object detectors, giving 
a significant advantage to this strongly supervised model. We train attribute 
classifiers using our feature and liblinear SVM [s]. These strongly supervised 


models are similar in spirit to the models used in 17 36 35 and can provide a 


performance upper bound for the weakly supervised models compared. 
w-SVM [^[^: In this weakly-supervised baseline, both object detectors and 
attribute classifiers are trained on the weak image-level labels as for our model 
(see Fig.[^. For aPascal, we train object and attribute classifiers using the feature 
extraction and model training codes (which is also based on [^) provided by the 
authors of [^ . For ImageNet, our features are used, without segmentation. 
MIML [4l): This is the multi-instance multi-label (MIML) learning method in 


41 . In a way, our model can also be considered as a MIML method with each 


image a bag and each patch an instance. The MIML model provides a mechanism 
to use the same super-pixel/patch based representation for images as our model, 
thus providing the object/attribute localisation capability as our model does. 
w-LDA: Weakly-supervised Latent Dirichlet Allocation (LDA) approaches 
31 have been used for object localisation. We implement a generalisation of 
LDA [2 31 that accepts continuous feature vectors (instead of bag-of-words). 


Like MIML this method can also accept patch based representation, but w-LDA 
is more related to our WS-SIBP than MIML since it is also a generative model. 


4.1 Image annotation with object-attribute association 

An image description can be automatically generated by predicting objects and 
their associated attributes. Evaluating the performance of a multi-faceted frame¬ 
work covering annotation, association and localisation is non-trivial. To com¬ 
prehensively cover all aspects of performance of our method and competitors, 
we perform three annotation tasks with different amount of constraints on test 
images: (1) free annotation^ where no constraint is given to a test image, (2) 
annotation given objeet names^ where named but not located objects are known 
for each test image, and (3) annotation given loeations^ where objects locations 
are given in the form of bounding boxes, where the attributes can be predicted. 
Free annotation: For WS-SIBP, w-LDA and MIML the procedure in Sec. |3.4| 
is used to detect objects and then describe them using the top t attributes. For 
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aPascal 

w-SVM MIML 

w-LDA 

WS-SIBP 

DPM+s-SVM 

AP@2 

24.8 

28.7 

30.7 

38.6 

40.6 

AP@5 

21.2 

22.4 

24.0 

28.9 

30.3 

AP@8 

20.3 

21.0 

21.5 

24.1 

23.8 

ImageNet 

w-SVM MIML 

w-LDA 

WS-SIBP 

GT+s-SVM 

AP@2 

46.3 

46.6 

48.4 

58.5 

65.9 

AP@3 

41.1 

43.2 

43.1 

51.8 

60.7 

AP@4 

37.5 

38.3 

38.4 

47.4 

53.2 


Table 1: Free annotation performance evaluated on t attributes per object. 
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Fig. 5: Qualitative results on free annotation. False positives are shown in red. 
If the object prediction is wrong, the corresponding attribute box is shaded. 


the strongly supervised model on aPascal (DPM+s-SVM), we use DPM object 
detectors to find the most confident objects and their bounding boxes in each test 
image. Then we use the 64 attribute classifiers to predict top t attributes in each 
bounding box. In contrast, w-SVM trains attributes and objects independently, 
and cannot associate objects and attributes. We thus use it to predict only one 
attribute vector per image regardless of which object label it predicts. 

Since there are variable number of objects per image in aPascal, quantita¬ 
tively evaluating free annotation is not straightforward. Therefore, we evaluate 
only the most confident object and its associated top t attributes in each image, 
although more could be described. For ImageNet, there is only one object per 
image. We follow 11 39 in evaluating annotation accuracy by average precision 


(AP), given varying numbers (t) of predicted attributes per object. Note that if 
the predicted object is wrong, all associated attributes are considered wrong. 

Table compares the free annotation performance of the five models. We 
have the following observations: (1) Our WS-SIBP, despite learned with the 
weak image-level annotation, yields comparable performance to the strongly su¬ 
pervised model. The gap is particularly small for the more challenging aPascal 
dataset, whist for ImageNet, the gap is bigger as the strongly supervised GT+s- 
SVM has an unfair advantage by using the ground truth bounding boxes during 
testing. (2) WS-SIBP consistently outperforms the three weakly supervised alter¬ 
natives. The margin is particularly large for t = 2 attributes per object, which is 
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MIML w-LDA WS-SIBP 


Object Attribute 0-A Object Attribute 0-A Object Attribute 0-A 



|aero 
I metal 

^flower 

^yellow 



I person 
|cloth 
I horse 
|furry 



I person 
I cloth 
I moto 
|wheel 



Fig. 6: Illustrating the inferred patch-annotation. Object and attributes are 
coloured, and multi-label annotation blends colours. The bottom two groups 
each have two rows corresponding to the two most confident objects detected. 


closest to the true number of attributes per object. For bigger t, all models must 
generate some irrelevant attributes thus narrowing the gaps. (3) As expected, 
the w-SVM model obtains the weakest results, suggesting that the ability to 
locate objects is important for modelling object-attribute association. (4) Com¬ 
pared to the two generative models, MIML has worse performance because a 
generative model is more capable of utilising weak labels [^. (5) Between the 
two generative models, the advantage of our WS-SIBP over w-LDA is clear; due 
to the ability of IBP to explain each patch with multiple non-competing factors. 
(Training two independent w-LDA models for objects and attributes respectively 
is not a solution: the problem would re-occur for multiple competing attributes.) 

Fig.j^shows qualitative results on aPascal via the two most confident objects 
and their associated attributes. This is challenging data - even the strongly su¬ 
pervised DPM+s-SVM makes mistakes for both attribute and object prediction. 
Compared to the weakly supervised models, WS-SIBP has more accurate pre¬ 
diction - it jointly and non-competitively models objects and their attributes so 
object detection benefits from attribute detection and vice versa. Other weakly 
supervised models are also more likely to mismatch attributes with objects, 
e.g. MIML detects a shiny person rather than the correct shiny motorbike. 

To gain some insight into what has been learned by our model and why it is 
better than the weakly supervised alternatives. Fig. [^visualises the attribute and 
object factors learned by WS-SIBP model and by the two baselines that also use 
patches as input. It is evident that without explicit background modelling, MIML 
suffers greatly by trying to explain the background patches using the weak labels. 
In contrast, both w-LDA and WS-SIBP have good segmentation of foreground 
objects, showing that both the learned foreground and background topics are 
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w-SVM MIML 

w-LDA 

WS-SIBP 

strongly supervised 


aPascal 

- 

32.1 

35.5 

38.9 

41.8 

o 

ImageNet 

32.4 

33.5 

39.6 

51.5 

56.8 


aPascal 

33.2 

35.1 

35.8 

43.8 

42.1 

o 

ImageNet 

37.7 

39.1 

46.8 

53.7 

56.8 


Table 2: Results on annotation given object names (GN) or locations (GL). 




(a) 0-A, aPascal 


(b) 0-A, ImageNet (c) 0-A-A, ImageNet 


Fig. 7: Object-attribute query results as precision-average recall curve. 


meaningful. However, for w-LDA, since object and attributes topics compete for 
the same patch, each patch is dominated by either an object or attribute topic. 
In contrast, the object factors and attribute factors co-exist happily in WS-SIBP 
as they should do, e.g. most person patches have the clothing attribute as well. 


Annotation given object names (GN): In this experiment, we assume that 
object labels are given and we aim to describe each object by attributes, cor¬ 
responding to tasks such as: “Describe the car in this image”. For the strongly 
supervised model on aPascal, we use the object’s DPM detector to find the most 
confident bounding box. Then we predict attributes for that box. Here, annota¬ 
tion accuracy is the same as attribute accuracy, so the performance of different 
models is evaluated following 40 by mean average precision (mAP) under the 
precision-recall curve. Note that for aPascal, w-SVM reports the same list of at¬ 
tributes for all co-existing objects, without being able to localise and distinguish 
them. Its result is thus not meaningful and is excluded. The same set of conclu¬ 
sions can be drawn from Table [2] as in the free annotation task: our WS-SIBP at 
par with the supervised models and outperforming the weakly supervised ones. 


Given object location (GL): If we further know the bounding box of an 
object in a test image, we can simply predict attributes inside each bounding box. 
This becomes the conventional attribute prediction task |9 27 for describing an 
object. Table 1^ shows the results, where similar observations can be made as in 
the other two tasks above. Note that in this case the strongly supervised model 
is the method used in . The mAP obtained using our weakly supervised model 
is even higher than the strongly supervised model (though our area-under-ROG- 
curve value of 81.5 is slightly lower than the 83.4 figure reported in [^). 
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Fig. 8: Object-attribute query: qualitative comparison 


4.2 Object-attribute query 

In this task object-attribute association is used for image retrieval. Following 
work on multi-attribute queries 1^, we use mean average recall over all preci¬ 


sions (MAR) as the evaluation metric. Note that unlike 26 which requires each 
queried comhination to have enough (100) training examples to train conjunc¬ 
tion classifiers, our method can query novel never-previously-seen combinations. 
Three experiments are conducted. We generate 300 random object-attribute 
combinations for aPascal and ImageNet respectively and 300 object-attribute- 
attribute queries for ImageNet. For the strongly supervised model, we normalise 
and multiply object detector with attribute classifier scores. No object detector 


is trained for ImageNet so no result is reported there. For w-SVM, we use 30 


to calibrate the SVM scores for objects and attributes as in 26 . For the three 


WS models, the procedure in Sec. |3.4| is used to compute the retrieval ranking. 

Quantitative results are shown in Fig. and some qualitative examples in 
Fig.H Our WS-SIBP has a very similar MAR values to the strongly supervised 
DPM+s-SVM, while outperforming all the other models. w-SVM calibration 30 
helps it outperform MIML and w-LDA. However, the lack of object-attribute 
association and background modelling still causes problems for w-SVM. This 
is illustrated in the ‘dog-black-white’ example shown in Fig. where a white 
background caused an image with a black dog retrieved at rank 2 by w-SVM. 


5 Conclusion 

We have presented an effective model for weakly-supervised learning of objects, 
attributes, their location and associations. Learning object-attribute association 
from weak supervision is non-trivial but critical for learning from ‘natural’ data, 
and scaling to many classes and attributes. We achieve this for the first time 
through a novel weakly-supervised stacked IBP model that simultaneously dis¬ 
ambiguates patch-annotation correspondence, as well as learning the appearance 
of each annotation. Our results show that our model performs comparably with 
a strongly supervised alternative that is significantly more costly to supervise. 
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